These are the steps for pre-processing the IBTS data to retain just the desired elements for the analyses in the MEPS paper. Should be useful for repeating the analyses with updated data (if so it would be good to use the icesDATRAS
package to automate the downloading of the data). This may also be useful for anyone wanting to extract data and do similar analyses. This is not functionalised, but is better to work through the steps in an Rmarkdown-type format to understand and check what is going on.
Data were extracted from the IBTS DATRAS website by Julia Blanchard who then undertook some initial processing, as detailed in Supplementary Material of MEPS paper, and also included with the species-specific length-weight parameters from Fung et al. (2012). The data are saved within this package as dataOrig
(see data-raw/IBTS-data.R).
First, understand the data:
dim(dataOrig)
#> [1] 178435 13
names(dataOrig)
#> [1] "AphiaID" "Survey" "Year"
#> [4] "Quarter" "Area" "Species"
#> [7] "LngtClas" "CPUE_number_per_hour" "Taxonomic.group"
#> [10] "a" "b" "weight_g"
#> [13] "CPUE_bio_per_hour"
dataOrig[1:5,1:7]
#> # A tibble: 5 x 7
#> AphiaID Survey Year Quarter Area Species LngtClas
#> <int> <fct> <int> <int> <int> <fct> <int>
#> 1 101170 NS-IBTS 2003 1 4 Myxine glutinosa 360
#> 2 101170 NS-IBTS 2008 1 2 Myxine glutinosa 0
#> 3 101170 NS-IBTS 2004 1 5 Myxine glutinosa 0
#> 4 101170 NS-IBTS 2004 1 1 Myxine glutinosa 330
#> 5 101170 NS-IBTS 2013 1 4 Myxine glutinosa 330
dataOrig[1:5,8:13]
#> # A tibble: 5 x 6
#> CPUE_number_per_h~ Taxonomic.group a b weight_g CPUE_bio_per_ho~
#> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0556 Myxine glutino~ 0.0033 2.70 52.4 2.91
#> 2 0 Myxine glutino~ 0.0033 2.70 0 0
#> 3 0 Myxine glutino~ 0.0033 2.70 0 0
#> 4 0.0244 Myxine glutino~ 0.0033 2.70 41.4 1.01
#> 5 0.0909 Myxine glutino~ 0.0033 2.70 41.4 3.76
summary(dataOrig)
#> AphiaID Survey Year Quarter
#> Min. :101170 NS-IBTS:178435 Min. :1986 Min. :1
#> 1st Qu.:126436 1st Qu.:1993 1st Qu.:1
#> Median :126445 Median :2001 Median :1
#> Mean :125728 Mean :2001 Mean :1
#> 3rd Qu.:127140 3rd Qu.:2008 3rd Qu.:1
#> Max. :274304 Max. :2015 Max. :1
#>
#> Area Species LngtClas
#> Min. :1.000 Gadus morhua : 25832 Min. : 0.0
#> 1st Qu.:2.000 Amblyraja radiata : 9316 1st Qu.: 120.0
#> Median :4.000 Clupea harengus : 8919 Median : 230.0
#> Mean :3.839 Merlangius merlangus : 7676 Mean : 271.3
#> 3rd Qu.:6.000 Melanogrammus aeglefinus: 6527 3rd Qu.: 370.0
#> Max. :7.000 Pleuronectes platessa : 6403 Max. :1500.0
#> (Other) :113762
#> CPUE_number_per_hour Taxonomic.group a
#> Min. : 0.000 Gadus morhua : 25832 Min. :0.000100
#> 1st Qu.: 0.040 Amblyraja radiata : 9316 1st Qu.:0.003500
#> Median : 0.121 Conger conger : 8919 Median :0.004200
#> Mean : 11.972 Merlangius merlangus : 7676 Mean :0.006864
#> 3rd Qu.: 0.717 Melanogrammus aeglefinus: 6527 3rd Qu.:0.007100
#> Max. :7207.821 Pleuronectes platessa : 6403 Max. :0.235000
#> (Other) :113762
#> b weight_g CPUE_bio_per_hour
#> Min. :1.797 Min. : 0.00 Min. : 0.00
#> 1st Qu.:3.079 1st Qu.: 8.02 1st Qu.: 1.97
#> Median :3.156 Median : 80.87 Median : 27.94
#> Mean :3.146 Mean : 700.64 Mean : 469.33
#> 3rd Qu.:3.243 3rd Qu.: 439.08 3rd Qu.: 171.95
#> Max. :3.527 Max. :35630.04 Max. :310586.16
#>
Note that LngtClas
is in mm, not cm, but that a
and b
are the length-weight coefficients for the length being in cm. Will use cm as units later.
Some columns are duplicated and we just want to keep the useful ones. AphiaID
is a numerical code for each species. Need to know the number of areas, but don’t need to keep Area
.
Survey
and Quarter
are the same for all entries, and we don’t need to keep Area
, just need the number of areas.
numAreas = length(unique(dataOrig$Area))
numAreas
#> [1] 7
colsKeep = c("Year",
"AphiaID",
"LngtClas",
"CPUE_number_per_hour",
"a",
"b",
"weight_g",
"CPUE_bio_per_hour")
colsDiscard = setdiff(names(dataOrig), colsKeep)
colsDiscard
#> [1] "Survey" "Quarter" "Area" "Species"
#> [5] "Taxonomic.group"
Note that data
will change a lot in the following code.
data = sizeSpectra::s_select(dataOrig, colsKeep) # uses Sebastian Kranz's s_dplyr_funcs.r
data
#> # A tibble: 178,435 x 8
#> Year AphiaID LngtClas CPUE_number_per~ a b weight_g
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2003 101170 360 0.0556 0.0033 2.70 52.4
#> 2 2008 101170 0 0 0.0033 2.70 0
#> 3 2004 101170 0 0 0.0033 2.70 0
#> 4 2004 101170 330 0.0244 0.0033 2.70 41.4
#> 5 2013 101170 330 0.0909 0.0033 2.70 41.4
#> 6 2013 101170 0 0 0.0033 2.70 0
#> 7 2003 101170 0 0 0.0033 2.70 0
#> 8 2012 101170 250 0.04 0.0033 2.70 19.6
#> 9 1999 101170 0 0 0.0033 2.70 0
#> 10 2003 101170 290 0.0972 0.0033 2.70 29.2
#> # ... with 178,425 more rows, and 1 more variable: CPUE_bio_per_hour <dbl>
# str(data)
summary(data)
#> Year AphiaID LngtClas CPUE_number_per_hour
#> Min. :1986 Min. :101170 Min. : 0.0 Min. : 0.000
#> 1st Qu.:1993 1st Qu.:126436 1st Qu.: 120.0 1st Qu.: 0.040
#> Median :2001 Median :126445 Median : 230.0 Median : 0.121
#> Mean :2001 Mean :125728 Mean : 271.3 Mean : 11.972
#> 3rd Qu.:2008 3rd Qu.:127140 3rd Qu.: 370.0 3rd Qu.: 0.717
#> Max. :2015 Max. :274304 Max. :1500.0 Max. :7207.821
#> a b weight_g CPUE_bio_per_hour
#> Min. :0.000100 Min. :1.797 Min. : 0.00 Min. : 0.00
#> 1st Qu.:0.003500 1st Qu.:3.079 1st Qu.: 8.02 1st Qu.: 1.97
#> Median :0.004200 Median :3.156 Median : 80.87 Median : 27.94
#> Mean :0.006864 Mean :3.146 Mean : 700.64 Mean : 469.33
#> 3rd Qu.:0.007100 3rd Qu.:3.243 3rd Qu.: 439.08 3rd Qu.: 171.95
#> Max. :0.235000 Max. :3.527 Max. :35630.04 Max. :310586.16
min(data$CPUE_number_per_hour)
#> [1] 0
So no negative CPUE values or spurious weights. There are a lot of zero CPUE values:
Want to end up with data
in a standard format (based on some original analysis I did when writing the code). Need to rename some of the headings, make the lengths in cm not mm, and (for helpfulness) order by Year
, SpecCode
and then Lgnt
:
if(sum( colsKeep != c("Year", "AphiaID", "LngtClas", "CPUE_number_per_hour",
"a", "b", "weight_g", "CPUE_bio_per_hour")) > 0)
{ stop("Need to adjust renaming") }
names(data) = c("Year", "SpecCode", "LngtClass", "Number", "LWa", "LWb",
"bodyMass", "CPUE_bio_per_hour")
# CPUE_bio_per_hour is Number * bodyMass
data = dplyr::arrange(data, Year, SpecCode, LngtClass)
data
#> # A tibble: 178,435 x 8
#> Year SpecCode LngtClass Number LWa LWb bodyMass CPUE_bio_per_hour
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1986 105814 0 0 0.0031 3.03 0 0
#> 2 1986 105814 0 0 0.0031 3.03 0 0
#> 3 1986 105814 0 0 0.0031 3.03 0 0
#> 4 1986 105814 0 0 0.0031 3.03 0 0
#> 5 1986 105814 0 0 0.0031 3.03 0 0
#> 6 1986 105814 0 0 0.0031 3.03 0 0
#> 7 1986 105814 0 0 0.0031 3.03 0 0
#> 8 1986 105814 0 0 0.0031 3.03 0 0
#> 9 1986 105814 0 0 0.0031 3.03 0 0
#> 10 1986 105814 0 0 0.0031 3.03 0 0
#> # ... with 178,425 more rows
That shows that we have a lot of (i) repeated values that can be amalgamated (presumably repeated because at one point the data included details about trawls, or it’s just how the data were obtained), (ii) lots of Number == 0
that we can discard, though keep for now since will help verify the binning.
Year, SpecCode, LngtClass
, but these aren’t unique. For example, looking at just one species for one year for one length class:exampleSp = dplyr::filter(data, Year == 1986, SpecCode == 105814, LngtClass == 60)
exampleSp
#> # A tibble: 2 x 8
#> Year SpecCode LngtClass Number LWa LWb bodyMass CPUE_bio_per_hour
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1986 105814 60 0.0833 0.0031 3.03 754. 62.8
#> 2 1986 105814 60 0.0833 0.0031 3.03 754. 62.8
So, yes, we have multiple counts of 60cm fish of this species, which we can just aggregate together. Do this for all years, species and lengths:
data = dplyr::summarise(dplyr::group_by(data,
Year,
SpecCode,
LngtClass),
"Number" = sum(Number)/numAreas,
"LWa" = unique(LWa),
"LWb" = unique(LWb),
"bodyMass" = unique(bodyMass))
data
#> # A tibble: 49,191 x 7
#> # Groups: Year, SpecCode [2,550]
#> Year SpecCode LngtClass Number LWa LWb bodyMass
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1986 105814 0 0 0.0031 3.03 0
#> 2 1986 105814 10 0.0571 0.0031 3.03 3.31
#> 3 1986 105814 45 0.00714 0.0031 3.03 315.
#> 4 1986 105814 46 0.00714 0.0031 3.03 337.
#> 5 1986 105814 50 0.00714 0.0031 3.03 434.
#> 6 1986 105814 52 0.0293 0.0031 3.03 489.
#> 7 1986 105814 53 0.0109 0.0031 3.03 518.
#> 8 1986 105814 54 0.0113 0.0031 3.03 548.
#> 9 1986 105814 56 0.0218 0.0031 3.03 612.
#> 10 1986 105814 57 0.0188 0.0031 3.03 646.
#> # ... with 49,181 more rows
summary(data)
#> Year SpecCode LngtClass Number
#> Min. :1986 Min. :101170 Min. : 0.00 Min. : 0.0000
#> 1st Qu.:1993 1st Qu.:126426 1st Qu.: 14.00 1st Qu.: 0.0087
#> Median :2001 Median :126461 Median : 27.00 Median : 0.0347
#> Mean :2001 Mean :124528 Mean : 32.48 Mean : 6.2037
#> 3rd Qu.:2008 3rd Qu.:127140 3rd Qu.: 45.00 3rd Qu.: 0.2604
#> Max. :2015 Max. :274304 Max. :150.00 Max. :1591.4413
#> LWa LWb bodyMass
#> Min. :0.000100 Min. :1.797 Min. : 0.00
#> 1st Qu.:0.003400 1st Qu.:3.054 1st Qu.: 17.05
#> Median :0.004200 Median :3.147 Median : 152.50
#> Mean :0.008107 Mean :3.129 Mean : 918.59
#> 3rd Qu.:0.007800 3rd Qu.:3.243 3rd Qu.: 718.34
#> Max. :0.235000 Max. :3.527 Max. :35630.04
dplyr::filter(data, SpecCode == 105814, Year == 1986, LngtClass == 60)
#> # A tibble: 1 x 7
#> # Groups: Year, SpecCode [1]
#> Year SpecCode LngtClass Number LWa LWb bodyMass
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1986 105814 60 0.0238 0.0031 3.03 754.
So Number
here correctly equals the sum of the first two rows of exampleSp
divided by seven (areas).
Number
is the average number (of each species and length) caught per hour of trawling across all seven areas.
Just confirm the calculations for bodyMass
(body mass of an individual of that LngtClass
) since they were done during preprocessing; should get the same answer, using species-specific length-weight coversions.
data = dplyr::mutate(data,
bodyMass2 = LWa * LngtClass^LWb)
if(max(abs(data$bodyMass2 - data$bodyMass)) > 0.0001) stop("Check conversions")
data = dplyr::select(data, -bodyMass2) # don't keep the confirming column
Now only include body-mass classes above 4 g, following Blanchard et al. (2005), since data are unreliable for smaller organisms:
range(data$LngtClass)
#> [1] 0 150
range(data$bodyMass)
#> [1] 0.00 35630.04
sum(data$bodyMass == 0) # 2549
#> [1] 2549
sum(data$bodyMass < 4 ) # 6893
#> [1] 6893
data = dplyr::filter(data, bodyMass >= 4)
range(data$bodyMass)
#> [1] 4.045665 35630.037377
data
#> # A tibble: 42,298 x 7
#> # Groups: Year, SpecCode [2,182]
#> Year SpecCode LngtClass Number LWa LWb bodyMass
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1986 105814 45 0.00714 0.0031 3.03 315.
#> 2 1986 105814 46 0.00714 0.0031 3.03 337.
#> 3 1986 105814 50 0.00714 0.0031 3.03 434.
#> 4 1986 105814 52 0.0293 0.0031 3.03 489.
#> 5 1986 105814 53 0.0109 0.0031 3.03 518.
#> 6 1986 105814 54 0.0113 0.0031 3.03 548.
#> 7 1986 105814 56 0.0218 0.0031 3.03 612.
#> 8 1986 105814 57 0.0188 0.0031 3.03 646.
#> 9 1986 105814 58 0.0381 0.0031 3.03 680.
#> 10 1986 105814 59 0.0327 0.0031 3.03 717.
#> # ... with 42,288 more rows
summary(data)
#> Year SpecCode LngtClass Number
#> Min. :1986 Min. :101170 Min. : 4.00 Min. : 0.0003
#> 1st Qu.:1993 1st Qu.:126436 1st Qu.: 19.00 1st Qu.: 0.0110
#> Median :2001 Median :126450 Median : 31.00 Median : 0.0387
#> Mean :2001 Mean :124117 Mean : 36.99 Mean : 5.1455
#> 3rd Qu.:2008 3rd Qu.:127140 3rd Qu.: 49.00 3rd Qu.: 0.2657
#> Max. :2015 Max. :274304 Max. :150.00 Max. :1591.4413
#> LWa LWb bodyMass
#> Min. :0.000100 Min. :1.797 Min. : 4.05
#> 1st Qu.:0.003400 1st Qu.:3.054 1st Qu.: 45.96
#> Median :0.004200 Median :3.156 Median : 246.19
#> Mean :0.008035 Mean :3.131 Mean : 1068.13
#> 3rd Qu.:0.007800 3rd Qu.:3.243 3rd Qu.: 904.47
#> Max. :0.235000 Max. :3.527 Max. :35630.04
Total number of fish in this dataset is
The unique length classes are:
sort(unique(data$LngtClass))
#> [1] 4.0 5.0 6.0 7.0 8.0 9.0 10.0 10.5 11.0 11.5 12.0
#> [12] 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5
#> [23] 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0
#> [34] 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5
#> [45] 29.0 29.5 30.0 30.5 31.0 31.5 32.0 32.5 33.0 33.5 34.0
#> [56] 34.5 35.0 35.5 36.0 36.5 37.0 38.0 39.0 40.0 41.0 42.0
#> [67] 43.0 44.0 45.0 46.0 47.0 48.0 49.0 50.0 51.0 52.0 53.0
#> [78] 54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0 62.0 63.0 64.0
#> [89] 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0 73.0 74.0 75.0
#> [100] 76.0 77.0 78.0 79.0 80.0 81.0 82.0 83.0 84.0 85.0 86.0
#> [111] 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0
#> [122] 98.0 99.0 100.0 101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
#> [133] 109.0 110.0 111.0 112.0 113.0 114.0 115.0 116.0 117.0 118.0 119.0
#> [144] 120.0 121.0 122.0 123.0 124.0 125.0 126.0 127.0 128.0 129.0 130.0
#> [155] 131.0 132.0 133.0 134.0 135.0 138.0 139.0 140.0 142.0 144.0 145.0
#> [166] 146.0 149.0 150.0
diff(sort(unique(data$LngtClass)))
#> [1] 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#> [18] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#> [35] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#> [52] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [69] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [86] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [103] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [120] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [137] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [154] 1.0 1.0 1.0 1.0 1.0 3.0 1.0 1.0 2.0 2.0 1.0 1.0 3.0 1.0
The 0.5-cm length classes are only for two species, Atlantic Herring (code 126417) and European Sprat (code 126425), as confirmed here (no differences of 0.5cm):
temp = dplyr::filter(data, !(SpecCode %in% c(126417, 126425)))
unique(diff(sort(unique(temp$LngtClass))))
#> [1] 1 3 2
Aside – species names and codes are in specCodeNames
(though it needs updating as more species in the data than listed here):
specCodeNames
#> # A tibble: 111 x 2
#> species speccode
#> <fct> <int>
#> 1 Agonus cataphractus 127190
#> 2 Alosa alosa 126413
#> 3 Alosa fallax 126415
#> 4 Amblyraja radiata 105865
#> 5 Ammodytes marinus 126751
#> 6 Ammodytidae 125516
#> 7 Anarhichas lupus 126758
#> 8 Argentina silus 126715
#> 9 Argentina sphyraena 126716
#> 10 Arnoglossus 126109
#> # ... with 101 more rows
length(unique(specCodeNames$speccode)) # checking speccode are unique
#> [1] 111
Need this to stop earlier groups being kept (can mess up later code):
These next commands here (which are not run in this vignette) are to save IBTS_data
in the package (which has already been run once to build the package). Rename and save data
with a meaningful name for your own data.
So we have the following, where each row is a unique combination of Year
, SpecCode
and LngtClass
(cm, the minimum value of the 1-cm length bin [or 0.5-cm bin for Atlantic Herring and European Sprat]), and Number
gives the number of individuals per hour of trawling observed for the combination. Parameters LWa
and LWb
are the length-weight cofficients for that species from Fung et al. (2012), bodyMass
(g) is the resulting estimated body mass for an individual of that species and length class and Biomass
(g h-1) is calculated here as the total biomass caught per hour of trawling for each row.
The resulting Table 1 is:
data_biomass <- dplyr::mutate(data,
Biomass = Number * bodyMass)
knitr::kable(rbind(data_biomass[1:6,],
data_biomass[(nrow(data_biomass)-5):nrow(data_biomass),
]),
digits=c(0, 0, 0, 3, 4, 4, 2, 2))
Year | SpecCode | LngtClass | Number | LWa | LWb | bodyMass | Biomass |
---|---|---|---|---|---|---|---|
1986 | 105814 | 45 | 0.007 | 0.0031 | 3.0290 | 315.46 | 2.25 |
1986 | 105814 | 46 | 0.007 | 0.0031 | 3.0290 | 337.17 | 2.41 |
1986 | 105814 | 50 | 0.007 | 0.0031 | 3.0290 | 434.05 | 3.10 |
1986 | 105814 | 52 | 0.029 | 0.0031 | 3.0290 | 488.81 | 14.33 |
1986 | 105814 | 53 | 0.011 | 0.0031 | 3.0290 | 517.84 | 5.65 |
1986 | 105814 | 54 | 0.011 | 0.0031 | 3.0290 | 548.00 | 6.18 |
2015 | 154675 | 34 | 0.028 | 0.0244 | 2.0439 | 32.93 | 0.92 |
2015 | 274304 | 8 | 0.013 | 0.0080 | 3.1410 | 5.49 | 0.07 |
2015 | 274304 | 14 | 0.039 | 0.0080 | 3.1410 | 31.85 | 1.24 |
2015 | 274304 | 15 | 0.052 | 0.0080 | 3.1410 | 39.55 | 2.05 |
2015 | 274304 | 16 | 0.065 | 0.0080 | 3.1410 | 48.44 | 3.15 |
2015 | 274304 | 17 | 0.013 | 0.0080 | 3.1410 | 58.60 | 0.76 |